JVM Thread Dump Analysis

Goal: Get insights about thread states in a production environment.

Inspired by: https://github.com/jakevdp/JupyterWorkflow


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import pandas as pd
plt.style.use('seaborn')

from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
from mpl_toolkits.mplot3d import Axes3D

In [2]:
import jvmthreadparser.parser as jtp

Get Data

Dumps generated every 2 minutes and saved in one single file. Period: May/ 2017.


In [3]:
dump = jtp.open_text('threads4.txt', load_thread_content = False)

In [4]:
dump.head()


Out[4]:
DateTime State
0 2017-04-30 01:02:01 RUNNABLE
1 2017-04-30 01:02:01 WAITING (ON OBJECT MONITOR)
2 2017-04-30 01:02:01 TIMED_WAITING (PARKING)
3 2017-04-30 01:02:01 TIMED_WAITING (PARKING)
4 2017-04-30 01:02:01 TIMED_WAITING (PARKING)

Thread State by Date

  • Problem: How many threads exist in each state?
  • Goal: Reshape the data using the date as index and states as columns.
  • How:

In [5]:
dump['Threads'] = 1
threads_by_state = dump.groupby(['DateTime','State']).count().unstack().fillna(0)
threads_by_state.columns = threads_by_state.columns.droplevel()
threads_by_state.head()


Out[5]:
State BLOCKED (ON OBJECT MONITOR) RUNNABLE TERMINATED TIMED_WAITING (ON OBJECT MONITOR) TIMED_WAITING (PARKING) TIMED_WAITING (SLEEPING) WAITING (ON OBJECT MONITOR) WAITING (PARKING)
DateTime
2017-04-30 01:02:01 0.0 27.0 0.0 54.0 136.0 3.0 4.0 0.0
2017-04-30 01:04:01 0.0 31.0 0.0 54.0 131.0 3.0 4.0 0.0
2017-04-30 01:06:01 0.0 26.0 0.0 55.0 113.0 3.0 4.0 10.0
2017-04-30 01:08:01 0.0 25.0 0.0 55.0 114.0 3.0 4.0 10.0
2017-04-30 01:10:01 0.0 28.0 0.0 54.0 112.0 3.0 4.0 10.0

In [6]:
ax = threads_by_state.plot(figsize=(14,12), cmap='Paired', title = 'Thread State by Date')
ax.set_xlabel('Day of Month')
ax.set_ylabel('Number of Threads');


Average of Threads by Hour

  • Problem: Are there any peak hour for thread states?
  • Goal: Plot thread states by hour (0-24).
  • How:

In [7]:
ax = threads_by_state.groupby(threads_by_state.index.hour).mean().plot(figsize=(14,12), cmap='Paired', title='Threads by Hour')
ax.set_xlabel('Hour of the Day (0-23)')
ax.set_ylabel('Mean(Number of Threads)');


Average of Threads by Day

  • Problem: Are there any peak day for thread states?
  • Goal: Plot thread states by day (2017-05-01 / 2017-05-29).
  • How:

In [8]:
ax = threads_by_state.resample('D').mean().plot(figsize=(14,12), cmap = 'Paired')
ax.set_xlabel('Day of Month')
ax.set_ylabel('Mean(Number of Threads)');


Threads in TIMED_WAITING (PARKING) by Hour Each Day

  • Problem: Are there any pattern in TIMED_WAITING(PARKING) threads?
  • Goal: Plot TIMED_WAITING (PARKING) threads. Each line represents a day. Thus, we can try visualize some patterns in data.
  • How:

In [9]:
by_hour = threads_by_state.resample('H').mean()
pivoted = by_hour.pivot_table("TIMED_WAITING (PARKING)", index = by_hour.index.time, columns = by_hour.index.date).fillna(0)
ax = pivoted.plot(legend=False, alpha = 0.3, color = 'black', title = 'Day Patterns of TIMED_WAITING (PARKING) Threads by Time', figsize=(14,12))
ax.set_xlabel('Time')
ax.set_ylabel('Number of Threads');


Principal Component Analysis

  • Problem: Can we plot clustering patterns?
  • Goal: Use PCA to reduce data dimensionality to 3 dimensions.
  • How:

In [10]:
X = pivoted.fillna(0).T.values
X.shape


Out[10]:
(30, 24)

In [11]:
X2 = PCA(3, svd_solver='full').fit_transform(X)
X2.shape


Out[11]:
(30, 3)

In [12]:
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X2[:, 0], X2[:, 1], X2[:, 2])
ax.set_title('PCA Dimensionality Reduction (3 Dimensions)')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3');


Unsupervised Clustering

  • Problem: Can we put colors for identify each cluster?
  • Goal: Use GaussianMixture to identify clusters.
  • How:

In [13]:
gmm = GaussianMixture(3).fit(X)
labels = gmm.predict(X)

In [14]:
fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111, projection='3d')

cMap = ListedColormap(['green', 'blue','red'])
p = ax.scatter(X2[:, 0], X2[:, 1], X2[:, 2], c=labels, cmap=cMap)
ax.set_title('Unsupervised Clustering (3 Clusters with Colors)')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3');
colorbar = fig.colorbar(p, ticks=np.linspace(0,2,3))
colorbar.set_label('Cluster')


Visualizing Clustering

  • Problem: How identify threads in each cluster?
  • Goal: Generate plots showing threads in each cluster.
  • How:

In [15]:
fig, ax = plt.subplots(1, 3, figsize=(14, 6))

pivoted.T[labels == 0].T.plot(legend=False, alpha=0.4, ax=ax[0]);
pivoted.T[labels == 1].T.plot(legend=False, alpha=0.4, ax=ax[1]);
pivoted.T[labels == 2].T.plot(legend=False, alpha=0.4, ax=ax[2]);

ax[0].set_title('Cluster 0')
ax[0].set_xlabel('Time')
ax[0].set_ylabel('Number of Threads')
ax[1].set_title('Cluster 1');
ax[1].set_xlabel('Time')
ax[2].set_title('Cluster 2');
ax[2].set_xlabel('Time')


Out[15]:
<matplotlib.text.Text at 0x289bf5d2a58>

Comparing with Day of Week

  • Problem: Can weekday explain this variability?
  • Goal: Plot clusters using one color per weekday (Monday=0, Sunday=6).
  • How:

In [16]:
dayofweek = pd.DatetimeIndex(pivoted.columns).dayofweek

In [17]:
fig = plt.figure(figsize=(14, 10))
ax = fig.add_subplot(111, projection='3d')
p = ax.scatter(X2[:, 0], X2[:, 1],  X2[:, 2], c=dayofweek, cmap='rainbow')
ax.set_title('Unsupervised Clustering (3 Clusters) Colored by Weekday')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3');
colorbar = fig.colorbar(p)
colorbar.set_label('Weekday (0=Monday, Sunday=6)')